AD-A145  131 
UNCLASSIFIED 


MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BUREAU  OF  STANDARDS- J963-A 


RESEARCH 
CENTER  ^ " 


ASSESSING  THE  RELIABILITY  OF  COMPUTER  SOFTWARE 
AND  COMPUTING  NETWORKS:  AN  OPPORTUNITY  FOR 
PARTNERSHIP  WITH  COMPUTER  SCIENTISTS 

by 

Richard  E.  Barlow+  and  Nozer  D.  Singpurwalla++ 


ORC  84-7  July  1984 


+ Department  of  Industrial  Engineering  and  Operations  Research, 
University  of  California,  Berkeley,  California. 

+  + 

Department  of  Operations  Research,  The  George  Washington  University, 
Washington,  D.C. 

This  research  has  been  partially  supported  by  the  Air  Force  Office  of 
Scientific  Research  (AFSC),  USAF,  under  Grant  AFOSR-81 -01 22  with  the 
University  of  California,  the  Office  of  Naval  Research  under  Contract 
N00014-77-C-0263,  and  the  Naval  Air  Systems  Command  under  Contract 
N000 19-8^8^0358  with  The  George  Washington  University.  Reproduction 
in  whole  or  in  part  is  permitted  for  any  purpose  of  the  United  States 
Government. 


Unclassified 


SECURITY  CLASSIFICATION  of  THIS  PAGE  (Whtn  Dote  Enfrtd) 


REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS 

BEFORE  COMPLETING  FORM 

1-  report  NUMBER 

ORC  84-7 

2.  GOVT  ACCESSION  NO. 

/?/V6~  /3  ■ 

p  RECIPIENT'S  CATALOG  NUMBER 

4.  TITLE  (and  Subtitle) 

ASSESSING  THE  RELIABILITY  OF  COMPUTER  SOFTWARE 
AND  COMPUTING  NETWORKS:  AN  OPPORTUNITY  FOR 

5.  TYPE  OF  REPORT  A  PERIOD  COVEREO 

Technical  Report 

PARTNERSHIP  WITH  COMPUTER  SCIENTISTS 

6.  PERFORMING  ORG.  REPORT  NUMBER 

7.  author^; 

1.  CONTRACT  OR  GRANT  NUMBERfaj 

Richard  E.  Barlow  and  Nozer  D.  Singpurwalla 

AFOSR-81-0122 

s.  performing  organization  name  and  aodress 

Operations  Research  Center 

University  of  California 

Berkeley,  California  94720 

10.  PROGRAM  ELEMENT.  PROJECT,  TASK 
AREA  6  WORK  UNIT  NUMBERS 

2304/A5 

II.  CONTROLLING  OFFICE  NAME  ANO  ADDRESS 

United  States  Air  Force 

12.  REPORT  DATE 

July  1984 

Air  Force  Office  of  Scientific  Research 

Bolling  Air  Force  Base,  D.C.  20332 

IS.  NUMBER  OF  PAGES 

10 

14.  MONITORING  AGENCY  NAME  A  AODRESSflf  different  from  Controlling  Office) 

IS.  SECURITY  CLASS.  ( ol  this  report) 

Unclassified 

ISa.  DECLASSIFICATION/DOWNGRADING 
SCHEDULE 

16.  DISTRIBUTION  STATEMENT  (of  l hie  Report) 

Approved  for  public  release;  distribution  unlimited. 

17.  DISTRIBUTION  STATEMENT  (of  the  afealraci  entered  In  Block  20.  II  dlllerent  horn  Report) 

IS.  SUPPLEMENTARY  NOTES 

19.  KEY  WORDS  ( Confirm e  on  reverem  aid*  If  neceeeery  end  Identity  by  block  number) 

Computer  Reliability 

Software  Reliability 

Network  Reliability 

20.  ABSTRACT  (Contfnuo  on  rover  am  at  dm  ft  nmemmaory  md  Identity  by  block  number) 

(SEE  ABSTRACT) 

DD  I  j°*M7|  1473  COITION  or  I  NOV  «s  IS  OBSOLETE 

S/N  01 02- LF -01 4-6601 


_ Unclassified 

SECURITY  CLASSIFICATION  OF  THIS  PACE  fPlian  Data  tntorod) 


ASSESSING  THE  RELIABILITY  OF  COMPUTER  SOFTWARE  AND  COMPUTING  NETWORKS: 
AN  OPPORTUNITY  FOR  PARTNERSHIP  WITH  COMPUTER  SCIENTISTS 


*  t 

Richard  E.  Barlow  and  Nozer  D.  Slngpurwalla 


University  of  California,  Berkeley,  California; 

The  George  Washington  University,  Washington,  D.C. 

V  ✓*"  rh  « 

In  this  paper,  we  highlightjj^three  areas  which  are  of  importance  in  computer  science, 
and  in  which  statisticians  can  make  valuable  contributions.  outline  these  areas, 
survey  the  developments,  and  point  out  some  of  the  open  problems.  -» 'i  ^  frrrtr  •, 


1.  INTRODUCTION 

An  aspect  of  computer  science  which  appears  to 
be  largely  overlooked  by  statisticians  is  that 
which  pertains  to  assessing  the  quality  of  a 
piece  of  software,  or  the  trustworthiness  of 
computer  hardware  and  computing  network?.  Much 
of  the  work  in  this  research  area  is  probabilis¬ 
tic  in  nature,  but  is  undertaken  by  engineers 
and  is  published  in  the  engineering  literature. 
However,  there  do  remain  many  unsolved  problems 
which  statisticians  could  help  solve  and  whose 
solution  could  Influence  the  state  of  the  art 
of  computer  science,  especially  as  it  relates 
to  the  design  of  hardware,  the  configuration  of 
a  network  of  computers,  and  policies  of  soft¬ 
ware  development.  Thus,  these  problems  present" 
a  good  opportunity  for  statisticians-Woevelop 
a  genuine  partnership  with  computer  scientists. 
This  is  in  contrast  with  tW  present  situation 
wherein  statisticians  ate  using  the  fruits  of 
computer  technology  but  are  not  working  on  the 
statistical  problems  presented  by  computer  tech¬ 
nology.  It  appears  that  much  of  the  previous 
activity  classified  as  the  Interface  between 
computer  science  and  statistics,  has  focused  on 
the  above  aspect.  In  this  paper  we  hope  to  whet 
the  statisticians'  appetite  by  highlighting  some 
of  the  issues^  that  would  constitute  an  addition¬ 
al  dimension  ^o  the  Interface  between  computer 
science  and  statistics.  We  point  out  here  three 
research  areas,  software  reliability,  the  reli¬ 
ability  of  fault  tolerant  computers,  and  the  re¬ 
liability  of  computer  networks. (  _ 

2.  SOFTWARE  RELIABILITY 

As  a  result  of  the  rapid  technological  advances 
in  microelectronics  and  microprocessors,  problems 
of  reliability  have  changed  from  those  involving 
hardware  to  those  involving  software.  According 
to  recent  estimates,  software  costs  amount  to 
about  60-80X  of  total  system  costs,  and  software 
failures  account  for  some  of  the  major  difficul¬ 
ties  in  the  operation  of  commercial  and  military 
systems.  Software  has  become  a  critical  compo¬ 
nent  for  the  navigation  and  control  of  aerospace 
systems,  for  the  successful  use  of  medical  care 
systems,  for  systems  of  interest  to  national 
defense,  and  for  communication  systems.  The 


failure  of  such  systems  typically  results  in 
some  serious  economic  and  strategic  conse¬ 
quences,  and  it  is  now  a  well  noted  fact  that 
software  failures  are  often  the  primary  cause 
of  system  failure.  To  cite  an  example,  it  was 
the  failure  of  software  that  caused  the  post¬ 
poning  of  a  recent  mission  of  the  Space  Shuttle. 

Considerations  of  the  above  type  have  given 
birth  to  the  subject  of  software  engineering  of 
which  the  assessment  of  software  quality  is  an 
Important  element.  Large  corporations  like 
A.T.&T.'s  Bell  Laboratories,  the  International 
Business  MachiaetT'Corporation,  British  Aero¬ 
space^.  antr'Government  Laboratories  such  as  NASA 
at  "Goddard  and  Langley,  have  instituted  major 
efforts  directed  towards  the  study  of  software 
quality  and  software  reliability.  Software 
reliability  is  a  measure  of  software  quality; 
it  is  the  probability  of  failure  free  operation 
of  the  software,  in  a  specified  environment, 
for  a  specified  period  of  time.  The  quantifi¬ 
cation  and  the  measurement  of  software  reli¬ 
ability,  an  assessment  of  the  changes  in  soft¬ 
ware  reliability  over  time  (reliability  growth), 
the  analysis  of  software  failure  data,  and  the 
decision  of  whether  to  continue  or  to  stop  test¬ 
ing  software,  are  issues  which  the  statistician 
is  well  qualified  to  address. 

2.1  Models  for  Describing  Software  Reliability 

To  facilitate  answering  some  of  the  above  is¬ 
sues,  Beveral  probabilistic  models  which  des¬ 
cribe  the  stochastic  behavior  of  software  run¬ 
ning  times  have  been  proposed  in  the  literature. 
A  good  model  can  aid  program  developers  esti¬ 
mate  the  time  (and  effort)  required  to  achieve 
a  desired  level  of  reliability.  Features  which 
distinguish  software  reliability  models  from 
hardware  reliability  models  stem  from  the  fact 
that  software,  unlike  hardware,  does  not  age 
with  time,  and  that  software  failures  are  due 
to  human  errors  of  design  and  logic.  In  con¬ 
trast  to  physical  failures,  the  problems  of 
man-made  faults  have  not  been  suddenly  allevi¬ 
ated  by  a  technological  breakthrough  similar  to 
the  invention  of  semiconductor  and  ferrite  core 
components.  Consequently,  once  a  software  error 
is  detected  and  diagnosed,  it  is  corrected 


2 


forever,  with  the  net  result  that  after  a  long 
span  of  time  the  software  tends  to  be  nearly 
perfect.  Furthermore,  whereas  hardware  can  be 
replicated  (made  redundant)  to  enhance  reli¬ 
ability,  software  replication  just  replicates 
the  bugs.  In  what  follows,  we  describe  a  sim¬ 
ple  "shock  model"  for  software  failures  which 
reflects  some  of  the  above  features.  The  model, 
when  generalized,  encompasses  many  of  the  other 
models  which  have  been  proposed. 

Let  N*  be  the  (known)  number  of  distinct  "input 
types"  to  the  software;  N*  is  assumed  large  re¬ 
lative  to  N,  the  number  of  input  types  which 
result  in  software  failures.  N  Is  assumed  un¬ 
known.  Let  us  assume  that  the  processing  of  an 
input  is  instantaneous,  and  that  once  failure 
occurs,  errors  in  the  logic  or  the  coding  of  the 
program  are  corrected,  with  the  net  result  that 
N  is  reduced  by  1.  If  Inputs  to  the  software 
occur  according  to  the  postulates  of  a  Poisson 
process  with  intensity  ui  then  given  N*  and  N, 
the  probability  of  no  failure  in  time  10, t) 


X  •~Mtfat)3  /h*  -nV  ~  N*  Wt 
j.Q  3'  \~^l 


This  is  the  familiar  shock  model  for  failures. 

Let  denote  the  time  between  (i-l)-st  and  the 

i-th  failure,  i  <_  N.  Given  N*  and  N,  the  sur¬ 
vival  function  of  T^  is  clearly 

Ft(t  |N,N*,u)  -  exp(-u(N-i+l)t/N*) ,  t>0.  (2.2) 

The  above  model  is  due  to  Langberg  and 
Singpurwalla  (1984)  who  also  show  that  the  sev¬ 
eral  commonly  used  models  for  describing  soft¬ 
ware  failures  are  special  versions  of  (2.2). 

For  example,  if  we  let  A  »  m/N*,  then  the  model 
of  Jellnskl  and  Moranda  (1972)  would  result. 

From  (2.2)  above  we  note  that  the  failure  rate 
of  T^  is  constant  in  time,  and  that  the  failure 

rate  of  T^  is  greater  than  the  failure  rate  of 
Ti+j,  1  *  1,2,  ...,  N;  this  latter  property  des¬ 
cribes  reliability  growth. 

Since  N  and  A  are  unknown,  we  Bhould  assess 
meaningful  prior  distributions  for  them.  Sup¬ 
pose  that  the  prior  distribution  for  N  is 
Poisson  with  mean  0,  and  let  us  assume,  for  the 
time  being,  that  A  -  X  is  known.  Then  Langberg 
and  Singpurwalla  also  show  that  M(t),  the  cumu¬ 
lative  number  of  failures  in  [0,t)  can  be  des¬ 
cribed  by  a  nonhomogeneous  Poisson  process  with 

EM(c)  *  9(1  -e  *E).  This  is  precisely  the  model 
considered  by  Goel  and  Okumoto  (1979).  If  on 
the  other  hand  N  *  n  were  assumed  known  and  the 
prior  distribution  of  A  were  a  gamma  distribu¬ 
tion,  then  the  model  considered  by  Littlewood 
and  Verrall  (1973)  would  result.  The  choice  of 
other  prior  distributions  for  N  and  A,  and  joint 


prior  distribution  for  N  and  A  would  result  In 
other  probability  models  for  describing  soft¬ 
ware  failures.  A  generalization  of  (2.1)  re¬ 
sulting  from  the  assumption  that  the  intensity 
a  Is  a  function  of  time  would  lead  to  still 
other  models  for  software  reliability.  For  ex¬ 
ample,  the  periodic  clustering  of  software  fail¬ 
ures  observed  by  Crow  and  Singpurwalla  (1984) 
suggest  the  possibility  of  assuming  a  cyclical 
intensity  function  for  the  underlying  Poisson 
process.  Other  models  for  describing  the  dis¬ 
covery  and  the  correction  of  errors,  the  latter 
being  possibly  imperfect,  are  due  to  Claudio 
and  Kaspersen  (1981)  and  Kremer  (1983).  These 
involve  the  use  of  birth  and  death  processes. 

Despite  the  above  efforts,  the  development  of 
new  models  for  more  realistically  describing 
the  stochastic  nature  of  software  failures  is 
an  actively  ongoing  effort  in  which  statisti¬ 
cians  can  make  valuable  contributions. 

2.2  Statistical  Inference,  Predictive  Dis¬ 
tributions  and  Stopping  Rules 

The  fact  that  the  maximum  likelihood  estimator 
of  the  parameter  N  in  the  model  (2.2)  can  be 
highly  misleading  and  often  nonsensical,  has 
been  noted  by  Forman  and  Singpurwalla  (1977). 
That  such  a  thing  can  happen  for  any  model  des¬ 
cribing  reliability  growth  has  been  pointed  out 
by  Melnhold  and  Singpurwalla  (1983)  who  advo¬ 
cate  a  Bayesian  approach.  Thus,  for  example, 

(k) 

for  any  k  <_  N,  and  any  t  *  (tj,  ....  t^) , 
where  t^  is  a  realization  of  T^ ,  if  we  assume 

that  the  prior  distribution  of  N  is  a  Poisson 
with  a  parameter  6,  and  the  prior  of  A  is  a 
gamma  with  scale  parameter  u  and  shape  para¬ 
meter  a,  independent  of  the  distribution  of  N, 
then  for  any  q  N,  the  posterior  probability 
for  N  •  q  and  A  •  T  is 

PIN  -  q,  A  -  X  |  t(k)) 


k 

-X  U+  l  (q-j+1)  t , 
k-!e  j-1  J 


r(a  +k) 


P{N  -q) 


l  T7^k)T  u+  l  (r-3+1>tj 

r-k'  '  I  j-1  \ 


where  P{ N  -  j }  -  e'6  0J/j!. 


The  above  result  enables  us  to  make  inferences 
about  the  parameters  N  and  A  of  the  Jellnski- 

Moranda  model,  from  which  Inferences  about  T, ,, 

k+1 

(i.e.,  its  predictive  distribution)  can  be 
ma  le. 


3 


In  practice,  It  nay  sometimes  happen  that  the 
process  of  testing  and  correcting  a  piece  of 
software  does  not  necessarily  lead  to  its  im¬ 
provement.  New  errors  may  be  introduced  into 
the  program  during  the  act  of  correcting  a  pre¬ 
viously  discovered  error.  Thus  one  is  often  in¬ 
terested  in  ascertaining  if  the  reliability  of 
the  software  is  indeed  Improving  over  time;  that 
is,  testing  for  reliability  growth.  Knowledge 
of  this  kind  may  Influence  existing  policies  of 
coding  and  programing,  such  as  team  program¬ 
ming,  modularized  programming,  etc.  Kyparlsls 
and  Singpurwalla  (1984)  propose  a  nonhomogeneous 
Poisson  process  model  with  a  Welbull  intensity 
function,  for  testing  for  reliability  growth, 
and  for  obtaining  the  predictive  distribution  of 
the  next  time  to  failure.  A  Bayes  Empirical 
Bayes  approach  involving  the  use  of  a  simple 
power  law  model  for  reliability  growth  has  been 
proposed  by  Horlgome,  Soyer  and  Singpurwalla 
(198A) .  Whereas  such  approaches  do  produce 
meaningful  results  when  applied  to  certain  sets 
of  actual  failure  data,  they  are  lacking  when 
applied  to  some  other  sets  of  data.  Thus  the 
need  for  some  additional  statistical  technology 
for  assessing  software  reliability  growth  does 
exist,  and  here  again  statisticians  can  make 
further  contributions. 

A  final  issue  pertaining  to  software  quality 
that  we  mention  here,  and  one  that  is  unresolved, 
is  the  one  which  addresses  the  question  of  when 
to  stop  testing  the  software  and  release  it  for 
its  Intended  use.  For  a  lack  of  a  better  term, 
we  refer  to  this  as  the  stopping  rule  problem, 
and  invite  the  attention  of  statisticians  to 
consider  this  issue,  because  it  is  only  they  who 
are  qualified  to  come  to  serious  grips  with  it. 
Some  preliminary  work  by  Forman  and  Singpurwalla 
(1979)  has  been  undertaken,  bur  is  less  than 
satisfactory  because  it  lacks  a  proper  balancing 
between  the  risk  of  software  failure  and  the 
added  "information"  gained  by  additional  testing 
of  the  software. 

3.  THE  RELIABILITY  OF  FAULT-TOLERANT  COMPUTERS 

Whereas  software  reliability  problems  have  at¬ 
tracted  the  attention  of  some  statisticians,  in¬ 
put  from  statisticians  addressing  Issues  per¬ 
taining  to  the  reliability  of  fault-tolerant 
computers  is  virtually  non-existent.  A  fault- 
tolerant  computer  is  a  complex  system  whose  key 
features  are  the  automatic  detection,  diagnosis, 
and  correction  of  errors  (faults).  The  system 
is  therefore  necessarily  dynamic,  and  assessing 
its  reliability  is  a  non-trivial  matter,  both 
from  a  conceptual  as  well  as  a  computational 
point  of  view.  Valuable  contributions  can  be 
made  by  statisticians  for  modelling  and  evalu¬ 
ating  the  operation  of  such  important  and  fast 
evolving  systems.  In  what  follows,  we  shall 
introduce  the  notion  of  fault-tolerance  and  its 
implementation  in  designing  computer  systems  of 
the  future.  In  the  sequel,  we  also  hope  to 
highlight  some  of  the  issues  of  Interest  (to 
statisticians),  that  such  systems  pose. 


3.1  Fault-Tolerance  and  Fault-Tolerant  Computers 

Fault-tolerance  is  an  attribute  of  a  digital 
system  that  keeps  the  logic  machine  doing  its 
specified  tasks  when  its  host,  the  physical  sys¬ 
tem,  suffers  various  kinds  of  failures  of  its 
components  [Avlzlenis  (1978)].  A  more  general 
concept  of  fault-tolerance  also  Includes  human 
mistakes  committed  during  software  and  hardware 
implementation  and  during  man/machlne  interac¬ 
tion.  In  the  quest  for  reliability,  the  only  al¬ 
ternative  to  fault-tolerance  is  fault-avoidance 
or  fault  intolerance.  This  requires  that  the 
physical  and  the  software  components  of  a  sys¬ 
tem,  and  their  assembly  techniques  be  as  near¬ 
ly  perfect  (failure-free)  as  possible.  The  com¬ 
bination  of  fault-avoidance  and  manual  mainte¬ 
nance  is  currently  the  methodology  employed  for 
reliability  assurance  with  most  computer  systems. 
Noteworthy  of  these  are  those  in  which  failure 
can  place  human  life  in  danger,  a  failure  which 
causes  heavy  economic  penalties,  and  the  use  of 
systems  which  operate  in  environments  that  do 
not  allow  access  for  manual  maintenance  such  as 
spacecraft  and  unmanned  underwater  stations. 

An  Important  example  involving  a  substantial  use 
of  fault-tolerant  computers  is  NASA's  efforts 
in  designing  avionics  (airplane)  computers  of 
the  future. 

Avionics  computers  of  the  1990's  will  be  self 
repairing,  and  are  so  designed  that  they  remain 
operational  despite  multiple  hardware  and  soft¬ 
ware  failures.  The  pilot  is  to  be  excluded  from 
the  flight-stabilizing  loop,  since  no  human  in¬ 
tervention  could  be  precise  enough  to  handle  the 
advanced  airframes  of  the  1990's.  The  ultra- 
reliable,  fault  tolerant  avionic  computers  will 
function  unattended,  despite  hardware  and  soft¬ 
ware  failures,  for  at  least  a  10-hour  flight. 

This  super-reliability  would  be  gained  through 
redundant  hardware  and  software;  faults  that 
occur  would  be  counteracted  automatically  by 
hardware  and  software  algorithms.  The  Federal 
Aviation  Administration  (FAA)  has  no  exact  re¬ 
liability  requirements  for  aircraft  systems,  but 

-9 

the  figure  that  NASA  uses  call  for  10  as  the 
expected  number  of  failures  in  a  10-hour  flight 
[Bernhard  (1980)].  Conventional  avionics  com- 

-2  -3 

puters,  by  contrast  have  10  to  10  as  the 
expected  number  of  failures  for  a  10-hour 
flight.  It  is  clear  from  the  above,  that  con¬ 
ventional  bench  and  field  tests  would  take  years 
to  perform  and  thus  cannot  be  used  to  certify 

-9 

reliability.  The  figure  10  is  almost  impos¬ 
sible  to  estimate  with  adequate  confidence  by 
conventional  tests  and  methods.  The  FAA  how¬ 
ever,  has  adopted  a  reliability  certification 
program  that  emphasizes  probabilistic  models 
and  simulations  of  the  system;  techniques  with 
which  the  statistician  is  well  acquainted. 

Other  notions  central  to  fault-tolerance  are 
error  detection,  error  recovery,  and  fault  re¬ 
covery.  Error  detection  is  the  process  of  dis¬ 
covering  that  the  system  is  in  an  erroneous 


I 


state,  and  error  recovery  is  undertaken  to  re¬ 
store  the  system  to  a  satisfactory  state  from 
an  erroneous  one,  and  thus  avoid  the  potential 
of  a  system  failure.  There  are  two  strategies 
for  error  recovery,  the  backyard  error  recovery 
and  the  fcrc’S-ard  error  recovery.  With  the  for¬ 
mer,  one  undoes  the  work  which  was  previously 
done  to  return  the  system  to  a  previous  accept¬ 
able  state.  With  foreward  error  recovery,  we 
either  Ignore  some  or  all  of  the  work  at  hand 
and  carry  one  with  only  that  which  we  believe 
to  be  correct,  or  send  compensating  Information 
which  corrects  erroneous  behavior.  If  we  use 
the  term  fault  to  denote  the  mechanistic  cause 
of  an  error,  then  fault  recovery  consists  of 
removing  a  faulty  component  from  service  and 
replacing  it. 

3.2  Implementing  Fault-Tolerance 

The  fundamental  principle  by  which  fault- 
tolerance  is  achieved  in  practice  is  via  the 
provision  of  redundancy  (spares)  of  resources 
which  may  fall,  be  damaged,  or  cease  to  be  ac¬ 
cessible.  Examples  of  these  are  processors, 
memory,  and  data.  Data  is  particularly  impor¬ 
tant  as  it  is  useless  having  spare  processing 
power  and  memory,  if  a  component  failure  causes 
the  data  to  be  Irretrievably  lost.  The  notion 
of  error  recovery  deals  with  ways  of  ensuring 
that  redundant  copies  of  data  are  available. 

Redundancy  is  of  two  types,  static  (or  parallel) 
and  dynamic  (or  stand-by).  With  the  latter, 
the  spare  resources  are  not  in  use  when  the  sys¬ 
tem  is  operating  normally  and  are  brought  into 
use  after  component  failure.  With  the  former, 
the  spare  resources  are  in  use  when  the  system 
is  running  normally,  and  they  mirror  the  opera¬ 
tion  of  the  components  they  are  to  replace. 
Dynamic  redundancy  is  usually  more  flexible, 
but  the  length  of  time  taken  to  recover  from  a 
component  failure  will  normally  be  greater  than 
for  static  redundancy.  Under  dynamic  redundan¬ 
cy,  the  reallocation  of  a  spare  resource  to  a 
failed  one,  is  usually  performed  under  program 
control.  This  may  be  a  relatively  slow  process 
and  may  cause  the  system  to  fail  because  it  does 
not  meet  some  deadline.  To  avoid  this,  a  fault 
tolerant  system  must  have  time  redundancy.  That 
is,  there  must  be  sufficient  spare  (processing) 
capacity  in  the  system,  for  reallocation  to  take 
place,  after  a  component  failure.  Time  redun¬ 
dancy  is  also  useful  in  achieving  fault  toler¬ 
ance  when  one  is  dealing  with  what  are  known  as 
transient  physical  faults  -  these  are  faults 
whose  manifestation  does  not  last  longer  than 
a  certain  maximum  time.  A  classic  example  of 
the  use  of  time  redundancy  in  dealing  with  tran¬ 
sient  faults,  is  a  communication  system  where 
packets  corrupted  by  noise  are  retransmitted. 

In  the  context  of  computers,  redundancy  can  be 
achieved  via  multiple  processors  or  multiple 
computers.  In  the  former,  we  have  many  proces¬ 
sors  communicating  via  a  shared  main  store.  In 
principle,  the  processors  can  readily  check  each 
other,  share  and  communicate  the  data  with  each 


other,  and  in  the  event  of  a  failure,  one  proc¬ 
essor  can  take  over  the  work  of  the  other.  The 
processors  in  a  multi-processor  system  are  phy¬ 
sically  very  close  making  them  susceptible  to 
common  mode  failures,  or  inducing  statistical 
dependencies  among  their  life  lengths. 

In  multiple  computer  systems,  independent  com¬ 
puters  communicate  via  some  communication  medium 
and  constitute  a  network  of  computers.  The  main 
advantage  of  such  a  system  emanates  from  the 
fact  that  physical  separation  of  the  computers 
Increases  the  capacity  of  the  system  to  with¬ 
stand  damage  at  a  particular  location.  The  dif¬ 
ficulties  and  problems  of  computing  the  reli¬ 
ability  of  such  systems  is  discussed  in  Section 
A. 

3.3  The  Use  of  Redundancy  for  Achieving  Fault- 
Tolerance 

A  simple  (non-redundant)  circuit  has  no  protec¬ 
tion  against  component  failures  -  the  failure 
of  a  component  may  lead  to  the  failure  of  the 
circuit.  If  there  is  no  mechanism  for  monitor¬ 
ing  the  circuit's  behavior,  then  the  failure  of 
the  circuit  may  not  be  immediately  apparent. 

Thus  a  simplex  circuit  is  not  fault-tolerant. 

A  duplex  system  consists  of  two  identical  cir¬ 
cuits  in  static  redundancy,  and  the  output  of 
the  two  circuits  are  compared.  The  failure  of 
any  component  in  either  circuit  will  cause  a 
discrepancy  between  the  outputs  of  the  two  cir¬ 
cuits.  When  this  happens  the  system  is  said  to 
have  failed,  because  we  are  not  able  to  deter¬ 
mine  which  of  the  two  circuits  have  failed.  We 
have  failure  detection  but  not  fault  tolerance. 
The  reliability  of  the  duplex  system  is  less 
than  the  reliability  of  the  simplex. 

The  addition  of  a  further  redundant  circuit 
yields  the  classical  Triple  Modular  Redundant 
(TMR)  circuit.  A  failure  in  any  circuit  would 
lead  to  a  discrepancy  between  its  output  and 
that  of  the  other  two.  By  using  the  majority 
voting  rule,  the  comparing  device  selects  the 
correct  output  and  sends  this  on  to  the  next 
stage.  Thus,  we  have  achieved  fault  and  error 
detection.  The  TMR  is  one  of  the  simplest 
fault  tolerant  systems.  After  one  failure,  the 
TMR  reduces  to  a  duplex  system.  We  can  verify 
that  the  TMR  system  is  the  most  reliable  in  the 
short  term.  The  comparison  device  is  generally 
simpler  than  the  circuits,  and  so  it  is  reason¬ 
able  to  assume  that  it  is  not  likely  to  fail. 

We  can  continue  to  increase  the  level  of  re¬ 
dundancy  provided,  and  produce  a  quadruple  re¬ 
dundant  circuit.  We  can  also  replicate  the 
comparison  (voter)  devices  to  protect  against 
failure  of  the  voting  system.  TMR  systems  with 
dynamic  replacement  of  failed  components  to 
maintain  fault  recovery  have  been  built.  A 
disadvantage  of  TMR  systems,  is  that  the  simul¬ 
taneous  failure  of  two  or  more  circuits  due  to 
a  common  mode  (or  a  shock),  when  not  detected 
could  lead  to  an  erroneous  output.  The  failed 


5 


circuits  would  constitute  a  majority  vote  and 
send  an  erroneous  output  to  the  next  stage. 

3.  A  Reliability  Analysis  of  Fault-Tolerant 

Computers 

It  should  be  clear  from  our  discussion  of  the 
triple  modular  redundant  circuit  that  a  c  Elec¬ 
tion  of  several  such  circuits  in  a  large  compu¬ 
ter  would  quickly  cascade  into  several  stages 
posing  difficult  computational  problems  for  the 
reliability  analysis  of  the  computer.  The  prob¬ 
lems  will  be  further  aggravated  if  we  have  quad¬ 
ruple  redundant  circuits  (which  are  more  reli¬ 
able  than  the  TMR  circuits)  or  if  we  replicate 
the  comparison  devices  to  ensure  further  reli¬ 
ability.  As  of  now,  no  satisfactory  method  for 
estimating  the  reliability  of  such  systems 
exists.  For  NASA's  avionics  computer,  one  has 

to  deal  with  10  states  if  the  problem  were  to 
be  approached  via  a  Markov  chain  model.  NASA 
engineers  use  the  Computer  Aided  Reliability 
Estimation  III  (CARE  III)  procedure  for  under¬ 
taking  such  analyses.  An  overview  of  CARE  III 
and  a  chronology  of  its  development  is  given  by 
Bavuso  (1984).  A  tutorial  on  CARE  111,  which 
is  more  detailed  than  Bavuso' s  paper,  is  given 
by  Trivedi  and  Geist  (1981).  The  potential  for 
refinement  of  the  CARE  III  approach  exists,  in¬ 
cluding  the  possibility  of  looking  at  the  prob¬ 
lem  from  an  entirely  different  point  of  view. 

We  draw  the  attention  of  statisticians  to  this 
problem. 

4.  A  SURVEY  OF  EFFICIENT  NETWORK  RELIABILITY 
COMPUTATIONAL  METHODS 

A  classic  problem  involving  computer  networks 
is  to  calculate  the  probability  that  certain 
distinguished  nodes  (computers)  in  a  network  are 
connected.  Arcs  and/or  nodes  in  the  network  may 
fall,  thus  disconnecting  some  computer  pairs. 

The  difficulty  of  the  problem  arises  partly  from 
the  complexity  of  the  network.  These  problems 
are  known  to  be,  in  the  worst  case,  NP-hard. 
Hence,  efficient  (l.e.,  polynomial  time  algo¬ 
rithms)  are  only  likely  for  special  structures. 
However,  it  has  been  conjectured  that,  in  par¬ 
ticular,  these  problems  can  be  solved  efficient¬ 
ly  for  planar  networks.  Further  complications 
arise  when  arc  and/or  computer  success  prob¬ 
abilities  are  not  known  but  their  uncertainty 
is  assessed  in  the  form  of  a  probability  den¬ 
sity,  perhaps  based  on  arc  and/or  computer  fail¬ 
ure  data.  Although  the  problem  can,  in  prin¬ 
ciple,  be  solved  using  standard  Bayesian  proce¬ 
dures,  it  can  be  solved  in  practice  only  with 
the  aid  of  efficient  computational  algorithms. 

In  the  next  sections  we  discuss  the  problem  in 
more  depth  and  describe  the  best  known  methods 
available  at  this  date. 

4.1  The  Statistical  Problem 

Given  a  network  such  as  the  ARPA  computer  net¬ 
work  (circa  1974)  in  Figure  1,  we  wish  to  cal¬ 
culate  the  probability  that  the  distinguished 


Figure  1:  ARPA  Computer  Network  ~  1974 


nodes  (dark  circles)  can  communicate  with  each 
other.  (In  general,  let  K  be  the  set  of  dis¬ 
tinguished  nodes.)  The  distinguished  nodes  will 
be  connected  if  there  is  at  least  one  K-tree, 
all  of  whose  arcs  work.  (A  K-tree  of  a  graph, 

G,  with  respect  to  K,  is  any  minimal  graph  which 
connects  all  the  distinguished  nodes  in  K.)  A 
K-tree  is  the  analogue  of  a  minimal  path  for  a 
2-terminal  network  problem. 

For  simplicity  we  will  assume  henceforth  that 
only  arcs  can  fall.  There  are  at  least  three 
possible  complications: 

1.  In  our  judgement,  arcs  may  not  fail  inde¬ 
pendently  of  one  another  given  arc  reli¬ 
ability  parameters  p, ,p_,  ....  p  ; 

i  z.  n 

2.  In  our  judgement,  some  arcs  may  have  iden¬ 
tical  reliability  parameters  so  that  they 
are  a  priori  dependent  relative  to  the 
joint  prior  density  for  parameters; 

3.  The  network  may  be  topologically  very  com¬ 
plex. 

We  will  use  the  following  notation.  Let  *  1 
if  arc  1  works  and  0  otherwise.  Let  ^(X^.Xj, 
...,  X^)  ■  1  if  the  distinguished  nodes  are 

connected  and  0  otherwise.  In  principle,  9  can 
be  represented  in  sums  of  products  form  and  our 
problem  can  be  solved  using  this  representation. 
In  practice,  this  is  not  feasible  for  large 
networks. 

A.  Satyanarayana  (1982)  obtained  the  most  effi¬ 
cient  algorithm  for  finding  ♦(XJ,X2 . Xn> 

within  the  class  of  methods  based  on  the 
inclusion-exclusion  principle.  However,  this 
is  an  exponential  time  algorithm  in  general. 

The  so-called  topological  formula  obtained  is 
of  the  form 

b.-v  +1 

♦(X,,X . X  )  -  l  (-1)  H  X 

i  J‘Ga,i  J 

where  C  Is  the  1-th  acyclic  K-subgraph  with 
a  •  1 


6 


arcs  and  vertices  or  nodes.  A  K-subgraph 

Is  a  connected  directed  graph  for  a  given  K  with 
every  arc  In  some  K-tree  with  respect  to  K.  If 
the  original  network  Is  undirected  (which  is 
true  for  the  network  of  Figure  1)  then  it  must 
be  directed  by  replacing  each  arc  by  two  arcs 
In  anti-parallel.  However,  arcs  In  such  anti¬ 
parallel  pairs  can  be  assumed  to  fall  Indepen¬ 
dently  of  each  other  but  have  the  same  reli¬ 
ability  parameter.  Independence  may  be  assumed 
since  only  one  member  of  each  pair  will  appear 
In  any  acyclic  K-subgraph.  A  Pascal  computer 
program  for  solving  the  2-terminal  problem  Is 
available  from  the  authors. 

Let  the  network  reliability  parameter  be 

h  -  P[*(*1,*2 . xn)  -  1  |  *(•)] 

where  *(■)  la  the  Joint  probability  density  for 

DEF 

arc  reliability  parameters  (p,,p„,  ...,p  )  - 

l  l  n 

g.  Since  h  will  have  a  probability  density  in¬ 
duced  by  n(p,,  p  ),  the  unconditional  net- 

1  n 

work  reliability  will  be  E[h  |  «(•)].  If  arcs 
are  judged  to  fall  Independently  of  each  other 
then 


Eh  «  E  h(p)  -  E 

E  E 


A-V1 


If,  furthermore,  arc  reliability  parameters  are 
judged  independent,  then 

b.-v  +1 

Eh  -  E  h(g)  -  l  (-1)  11  n  E(p  ). 

E  i  IeG  ,  J 


4.2  The  Variance  of  the  Network  Reliability 
Parameter  h 

n  n 

Assume  1  x  |  g  and  1  p  |  *(•)  where  1  in¬ 
i-1  1  i-1  1 

dlcates  Independence.  Given  g,  h(g)  is  the  net¬ 
work  reliability.  Since  g  has  joint  density 
*(•),  Var^(h(g)]  may  be  of  Interest  as  well  as 

E^h(g)  -  h(Eg) . 

Varfh]  can  be  approximated  by  calculating  ex¬ 
pressions  of  the  form  hUj.Oj.Ep)  where  (lj.Oj, 

Eg)  is  the  Ep  vector  with  coordinates  replaced 

by  1  whose  indices  are  in  I  and  coordinates 
whose  indices  are  in  J  ere  replaced  by  0. 

Using  Taylor’s  formula  for  several  variables, 
we  have 

h(g)  -  h (Eg) 

”  I  dkh(Eg;g  -Eg)  +  ±  d”h(z;g  -Eg) 


where  d'h(Eg;g  -Eg)  -  \  (h(l, ,Eg)  -  h(0, ,Eg)]» 

i-1 

(Pi-Epi),  etc.  and  z  is  on  the  line  segment 

joining  g  and  Eg.  Therefore,  to  the  first  or¬ 
der  (m  -  1) , 


Var[h(g))  •  l  [h(l  ,Eg)  -  h(0.,Eg)J2  Var(p  ). 
i-1 

Better  approximations  can  be  obtained  by  includ¬ 
ing  more  terms.  If  efficient  (i.e.,  linear  or 
polynomial  time)  methods  can  be  found  for  com¬ 
puting  h(lj,0j,Eg),  this  will  be  a  practical 

method  for  approximating  Var^[h(g)J. 

4.3  Efficient  Calculation  of  h(Eg) 

There  have  been  recent  breakthroughs  in  the 
development  of  efficient  algorithms  for  comput- 
n  n 

ing  h(Eg)  when  1  x,  |  g  and  1  p  |  n(*). 
i-1  1  i-1 

These  apply  to  planar  networks  of  special  struc¬ 
ture,  both  directed  and  undirected. 

Consider  the  undirected  graph  in  Figure  2.  Ef¬ 
ficient  methods  now  exist  to  calculate  the  prob¬ 
ability  that  all  nodes  in  such  a  graph  can  com¬ 
municate  (cf.  Politof  and  Satyanarayana  (1984)). 
The  graph  may  be  represented  by  an  adjacency 
matrix  or  an  adjacency  structure  [the  list  of 
all  successors  (neighbors)  of  each  vertex]. 


Figure  2:  Example  Graph 


The  approach  used  is  roughly  as  follows: 

(i)  Decompose  the  graph  into  its  tri- 
connected  components  (a  graph  is  tri¬ 
connect  ed  if  it  has  no  separation  pairs 
of  vertices.) 

(ii)  Find  a  planar  embedding  for  each  tri- 
connected  component. 


(ill)  If  (li)  is  successful,  use  A-Y  and 

polygon-to-chain  probability  reductions 
to  determine  the  reliability  of  each 
component. 

(iv)  Calculate  the  reliability  of  the  network 
using  the  calculated  reliabilities  of 
each  of  its  components. 

The  previous  approach  is  currently  being  imple¬ 
mented  for  the  all  terminal  network  reliability 
problem.  The  associated  algorithm  has  running 

time  of  order  |v|  where  |v|  is  the  number  of 
vertices  or  nodes. 

Linear  running  time  computer  programs  are  now 
available  for  both  directed  and  undirected 
networks  when  the  graph  is  basically  series- 
parallel.  Figure  3  shows  a  series-parallel 
graph.  A  graph  is  basically  series-parallel 
if  it  can  be  reduced  to  a  tree  by  series  and 
parallel  replacements  (as  contrasted  with 
series  and  parallel  probability  reductions). 


Figure  3:  Series  and  Parallel  Replacements 


Figure  4  shows  a  graph  which  is  not  basically 
series-parallel . 


Figure  4:  Example  Network  which  is  not 
Basically  Series-Parallel 


4.3  Available  Computer  Programs 

No  single  computer  program  is  available  for  ef¬ 
ficiently  computing  the  reliability  of  arbi¬ 
trary  planar  networks.  However,  the  following 
list  covers  special  structures  of  practical 
Interest. 


Pascal  (Berkeley  Pascal  PI  -  Version  2.0) 
DON,  SIYI  March  1981. 

2.  Undirected  Basically  Series  Parallel  Graphs. 

POLYCHAIN  -  (K  terminal) 

Fortran  IV  (standard) 

Maurlclo  G.  C.  Resende 
October  1983. 

3.  Directed  Basically  Series  Parallel  Graphs. 
NETWORK 

Pascal  (Berkeley  Pascal  PI  -  Version  3.0) 
Jack  Shalo 
December  1983 
2-terminal. 

FOOTNOTES 

* 

This  research  was  supported  by  the  Air  Force 
Office  of  Scientific  Research  (AFSC) ,  USAF,  un¬ 
der  Grant  AFOSR-81-0122  with  the  University  of 
California.  Reproduction  in  whole  or  in  part 
is  permitted  for  any  purpose  of  the  United 
States  Government. 

^Supported  by  the  Office  of  Naval  Research  un¬ 
der  Contract  N00014-77-C-0263  Project  NR-042- 
372,  and  the  Naval  Air  Systems  Command  under 
Contract  N00019-83-C-0358  with  the  George 
Washington  University. 

REFERENCES 

(1J  Agrawal,  A.  and  Barlow,  R.  E.,  A  Survey  of 
Network  Reliability,  ORC  83-5,  Operations 
Research  Center,  University  of  California, 
Berkeley  (July  1983);  to  appear  in 
Operations  Research. 

12 J  Avizlenls,  A.,  Fault-tolerance:  The  sur¬ 
vival  attribute  of  digital  systems. 
Proceedings  of  the  IEEE  (1978)  1109-1125. 

1 3]  Bavuso,  S.  J.,  A  User's  Overview  of  CARE 
III,  NASA  Langley  Research  Center, 

Hampton,  VA  (1984). 

(41  Bernhard,  R. ,  The  'no-downtime'  computer, 
IEEE  Spectrum  (September  1980)  33-37. 

[5]  Claudio,  L.  F.  and  Kaspersen,  D.  L., 

Example  applications  of  software  reli¬ 
ability  estimation,  Proceedings  of  the 
1981  Army  Numerical  Analysis  and  Computers 
Conference  (1981)  41-58. 

I6J  Crow,  L.  H.  and  Singpurwalla,  N.  D. ,  An 
empirically  developed  fourier  series 
model  for  describing  software  failures, 
to  appear  in  IEEE  Trans.  Reliability. 

(7)  Forman,  E.  H.  and  Singpurwalla,  N.  D., 

An  empirical  stopping  rule  for  debugging 
and  testing  computer  software,  J.  Amer. 
Statis.  Assoc.  72  (1977)  750-757. 


1.  General  Directed  Graphs  -  (2  terminal). 
TOPOLOGICAL  RELIABILITY  ANALYSIS  PROGRAM 


8 


Forman,  E.  H.  and  Singpurwalla,  N.  D., 
Optimal  time  intervals  for  testing  hypo¬ 
thesis  on  computer  software  errors,  IEEE 
Trans.  Reliability  R-28  (1979)  250-253. 

Coel,  A.  L.  and  Okumoto,  K. ,  Time  depen¬ 
dent  error  detection  rate  model  for  soft¬ 
ware  reliability  and  other  performance 
measures,  IEEE  Trans.  Reliability  R-26 
(1979)  206-211. 

Horlgoae,  M. ,  Singpurwalla,  N.  D.  and 
Soyer,  R. ,  A  Bayes  empirical  Bayes  ap¬ 
proach  for  (software)  reliability  growth. 
Proceedings  of  the  16th  Computer  Science 
and  Statistics  Symposium  on  the  Interface 
(L.  Blllard,  Ed.)  (Atlanta,  Georgia, 

March  14-16,  1984),  North  Holland,  to 
appear. 

Jellnski,  Z.  and  Moranda,  P.  B. ,  Software 
reliability  research,  in  Statistical 
Computer  Performance  Evaluation  (W. 
Frelberger,  Ed.)  (Academic  Press,  New 
York,  1972). 

Kremer,  W. ,  Birth-death  and  bug  counting, 
IEEE  Trans.  Reliability  R-32  (1983)  37-46. 

Kyparlsls,  J.  and  Singpurwalla,  N.  D., 
Bayesian  Inference  for  the  Ueibull  process 
with  applications  to  assessing  software 
reliability  growth  and  predicting  soft¬ 
ware  failures.  Proceedings  of  the  16th 
Computer  Science  and  Statistics  Symposium 
on  the  Interface  (L.  Blllard,  Ed.) 
(Atlanta,  Georgia,  March  14-16,  1984), 
North  Holland,  to  appear. 

Langberg,  N.  and  Singpurwalla,  N.  D. , 
Unification  of  some  software  reliability 
models,  SIAM  J.  Sci.  Statls.  Comp.  (1982), 
to  appear. 

Littlewood,  B.  and  Verral,  J.  L.,  A 
Bayesian  reliability  growth  model  for 
computer  software,  Record  IEEE  Symp. 

Comp.  Software  Reliability  (1973)  70-77. 

Melnhold ,  R.  J.  and  Singpurwalla,  N.  D., 
Bayesian  analysis  of  a  commonly  used  model 
for  describing  software  failures,  The 
Statistician  32  (1983)  168-173. 

Polltof,  T.  and  Satyanarayana,  A.,  An 

0< | V | 2)  Algorithm  for  a  Class  of  Planar 
Graphs  to  Compute  the  Probability  that 
the  Graph  is  Connected,  Operations 
Research  Center,  University  of  California, 
Berkeley  (1984). 

Resende,  M.,  A  Computer  Program  for  Reli¬ 
ability  Evaluation  of  Large-Scale  Un¬ 
directed  Networks  Via  Polygon-to-Chain 
Reductions,  ORC  83-10,  Operations 
Research  Center,  University  of  California, 
Berkeley  (1983). 


Satyanarayana,  A.,  A  unified  formula  for 
the  analysis  of  some  network  reliability 
problems,  IEEE  Trans.  Reliability  R-31 
(April  1982)  23-32. 

Satyanarayana,  A.  and  Agrawal,  A.,  Source 
to  K  Terminal  Reliability  Analysis  of 
Rooted  Communication  Networks,  ORC  83-2, 
Operations  Research  Center,  University 
of  California,  Berkeley  (February  1983); 
submitted  for  publication  to  Networks. 

Satyanarayana,  A.  and  Wood,  R.  K., 
Polygon-to-Chain  Reductions  and  Network 
Reliability,  ORC  82-4,  Operations 
Research  Center,  University  of  California, 
Berkeley  (March  1982);  submitted  for 
publication  to  SIAM  J.  Computing. 

Trivedi,  K.  S.  and  Gelst,  R.  M. ,  A  Tuto¬ 
rial  on  the  CARE  III  Approach  to  Reli¬ 
ability  Modeling,  NASA  Langley  Research 
Center,  Hampton,  VA  (1981). 


t 


